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3.1 References 

Appendices , 
A. Index DTD 

1. Introduction 

This document provides a complete description of the HTTP Distribution and ^^.[J^J. 
(DRP) Tte goal of the DRP protocol is to significantly improve the efficiency and reliability of data 
distribution over HTTP. Here is a more detailed list of goals: 

. Provide a widely applicable mechanism for the efficient replication of data and content. 
. Improve the efficiency and reliability of caching servers and proxies. . 
. Significantly improve the efficiency of subscription-based data and content distribution. 
. Saved reliability and versioning for the distribution of mission-critical applications and data. 
. Create an interoperability standard for replicating content between clients and servers from 

different vendors. 
. Keep the protocol simple to avoid incompatibilities. 
. Provide functionality that is complementary to existing standards. 
. Provide functionality that is backward compatible with existing HTTP servers and proxies. 
. Provide functionality that can be deployed anywhere where HTTP is available today. 

The HTTP Distribution and Replication Protocol was designed to efficiently replicate a hierarchical set 
of files to a large number of clients. No assumption is made about the content or type of the tiles; tney 
are simply files in some hierarchical organization. 

After the initial download a client can keep the data up-to-date using the DRP protocol. Using^RP the 
client can download only the data that has changed since the last time it checked. Downloading only the 
differences is much more efficient because data typically evolves slowly over time, and because changes 
are usually restricted to only a subset of the data. One of the goals of the DRP protocol is to avoid 
downloading the same data more than once. 

The data that is distributed can consist of more than just HTML pages; it can consist of any kind of code 
or content^The DRP protocol provides strong guarantees about data versioning. This is a requirement 
for distributing mission critical applications, because getting the correct version of each component is a 
necessary to ensure a complete application will work correctly. Correct file versioning usuig existing 
HTTP requests is problematic because there is currently no reliable mechanism for identifying a 
particular version of a file. 

The DRP protocol uses content identifiers to automatically share resources that are requested more than 
once. This eliminates redundant transfers of commonly used resources. The content identifiers used in 
the DRP protocol are based on widely accepted checksum technology. 

The DRP protocol usrahta-structuTe>lkdMlMaex, which is currently specified using the extensible 
Markup Language (XML). Becauselhe index describes meta data, we anticipate using ; the Resource 
Description Framework (RDF), in a future version of the DRP protocol specification. XML is used in 
, the interim because the RDF standard was not finalized at the time of writing. 

The DRP protocol relies on existing HTTP/1 .1 functionality to achieve better performance when 
replicating files, and it is backward compatible with existing HTTP/1 .0 and HTTP/1.1 proxies so that it 
can be deployed immediately. To further improve the download efficiency and caching behavior of 
HTTP, the proposal introduces new HTTP header information that may becornepart of the core HTTP 
protocol specification in the future. Because the DRP protocol is layered on HTTP, it will benefit from 
ongoing efforts in the HTTP-NG working group. 
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The DRP protocol is based on technology which was originally developed, implemented and deployed 
by Marimba Inc. Although it was originally designed for software distribution, it is widely applicable in 
many different areas, and it enables the large scale automatic distribution of software, data, and content 
to many different clients. 

2. The HTTP Distribution and Replication Protocol 

2.1 Content Identifiers 



| miepRP pr^llisesc^ waait ldentlfier ls 

a ^ token"that(^6elisaio uniquely identify "any piece of data or content. It can also be used to 
determine whether two pieces of content are identical with great accuracy. 

content identifier consists of one or more Uniform Resource Identifiers (URJ) separated by conunas. 
The URIs are combined together into a single content identifier and form a globally unique idenufier. 
Here is the syntax of a content-identifier: 



Typically.a checksiimAlgorithm is used to generate a <x>nter^e^erfbr a pjwe oJLcpntent/One of ftjT? 
.checksum algorithms which can beused is the Message Digest dgoritom from^RSA.J^ — /" 
/algon^is.a well-known algorithmTorcompu^ 

http://www.rsa.com/pub/rfcl 32 1 .txt for details on the implementation of the MD5 algorithm. 

/The likeUhood of two different fileTprod^ isvery small (aW.l in 2 A 64), 

and as such, tl^M^hecksum.of a file.can-be used 

'likelyto uniquelyJdentiiythe fileTThe reverse is also true. If too files Have the same MD5 chectoum,-iti_ 
^ -isw^Ukeiy_uSat.the-fil^ 

It is possible to define a Universal Resource Name (URN), which is a kind of URI, for a 128-bit MD5 
checksum: 



' MD5 -Tto "uiia :rad5 base64-nuraber 



Another checksum algorithm used in the DRP protocol is the SHA algorithm from the National Institute 
of Standards and Technology. It is similar to the MD5 algorithm but its checksums are 160 bits long, 
and have different properties. See http://csrc.nist.pov/fms/fip1 80*1 .txt for details on the implementation 
of the SHA algorithm. 

Here is how a URN is created using a 1 60-bit SHA checksum: 



SHA-URN = "urmsha:" baee64 -number 



Both MD5 and SHA based URIs use base64 encoding to encode the checksum of the content. The 
base64 algorithm encodes each 3 bytes of the checksum in 4 characters containing 6 bits of information 
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each. The details of the base64 encoding algorithm are part of the MIME specification. 

Note that a base64 number can include ? and V characters. Although these characters are treated 
specially in the URN syntax specification, it is not necessary to escape them when used in a checksum 
URN. 

If a content identifier contains a URI that refers to a well known checksum algorithm such as MD5 or 
SHA, it is possible to verify the integrity of the content. If none of the URIs in the content identifier refer 
to a known checksum algorithm, the content identifier should be treated as an opaque string that can be 
used for addressing purposes only. 

Here are some examples of valid content identifiers: 



urn :'ntdS :^G4ceMJ^p^lrc<^ipQ° 


»:r : .:-.. r"'- i v./-r 




urn ; radS : HUXZWI>Iul/k25KDc JPcOA= 


■ -■■■..T-y , v , . vri 

■ -V 










hitp:: //Www aci^cora*^ 






http:.//Ww i acme, cpm^Exan^il^/l^iu: 
itR^//www;: a<^;ioni/^lbi2 OWorl 


tiixfvefgion-: 





In applications where the possibility of duplicates fora given UR N is una cceptable, content identifiers 
can be generated with the required uniqueness by uSing multiple-^ If for example 

version numbers are used, it is important to further qualify the content identifier in order to make it 
globally unique. This can be achieved by including the URL of the object to which the version number 
applies in the content identifier. 



ft^nt^fidffltifier is often embedded in an HTTP header ?and therefore the URIs in the content 
identifier are not allowed to contain reserved characters such as spaces or comma characters. All 
reserved characters should be encoded as specified in the URI specification. 

2.2 Index Format 

To describe the exact state of a set of data files, the DRP protocol uses a data-structure called i ^an7n2eir? 
An index not only describes the hierarchical structure of the files, but it also describes the version, size, 
and type of each file. An index is a snapshot of the state of a set of files at a particular moment in time. 

An index is typically stored in mem ory t aT ^e ^tastructitf^but in order for clients and servers to 
communicate this information over rTITT^iiMextan beciescribed using XML. The index DTD used 
by the DRP protocol is specified in Appendix A . Using this DTD, here is an example of an index that 
describes a hierarchical set of files: 
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<?XML VBRSIONon.Q 11 RMD^" 


NONE* 7 > 






<index> 








. *file pathn "home: html »■ 


'BizeaU2345 N 


idi"urn:md5:PEPjWBDv/sd9alS9BYuX0w— 


•/> 


<£ile path=" layer ljs" * 


8ize«-3211?" 


ido f !urn:md5:W25YCu3toJt32sDsHIZmpg== 


"/> 


, <dir paths " images* > . 








; <f iI6..path=i 0 a'c^Vg \_ 


sizes B 4 532" •*' 


id= "urn:tnd5 : +hbZN5Xf U$QAJB1RF1/KSQ== 


7> 


. <f ile path= ,, bajnper 7gif •* : : . 


8izea?10452 ,i 

-^'V 


. id= "urn :md5 : «3X+oN3r9kqvsiyDSS] J0fw» 


"/> 


iX-'<diiic path=f.java^cla8^ 
> s xf iie "pa^h? M Scrdli"ljava r - 


'J: , ''"-' r .'../*iv " . V.-/ 
<:">•',■ • • *v\ ".. ^ • * ' , 

' •' : • ■ . ■ V .*' 

!Vsizl&1':tt323> • 


. id»"urn:^ 


"/> 


.* ! \.^-. » r^:--> , 
^^^le.'.pathW Pguly^ar?^^. 


;^iz^g?403# 


id= n um:md5 : tcUzwODKut3SiTpmpAsi8g== 


"/> 


^N^/. v- V - : -;:^-::^^->& 








;•: / : inde?C} .; ^ ■„ ; v; ^^:^ : ;.v-;-.*'/ s *: 
"vti* ~ : . ■ ■ ; . v ■ ? '• ' ^ • ■" J 
^^.••.;^S:.. . ; ■^.cr-'' 5 


.*'*' .'■ ''';'[*..'■ 

^'-'^ ) * ft ■"".'.«.-■'. 






'•■i . r ■ ;,*- •*.-.. s is- / ». . ■' <;■ * 
>m : .'iyv. : -: .-*.«*.... : •;,*,* *-* » *.'■/** 









2.3 Index Retrieval 



A DRP index is retrieved given a URL to the index. The index can be stored in any file and can be 
retrieved using a normal HTTP GET request The mime type of the index file allows clientMofreat the, 
index.as.a.specialjile which provides meta information about other files/T he client can use the index to J 
automaticallyjip^ 

The index file can be an ordinary file on an HTTP server, but it can also be generated dynamically. Note 
that the index isn't necessarily generated from a file system, it could also represent hierarchical data from 
a different source such as a database, 

(^nce the initial do^oad i^r^ 

(the jndex, and comparing it against thepreviourve^ionxf the'index^Because each file entry in the 
index has a content identifierrthe client ^ can determine which'files Have changed and so determine the 
minimal set of files that need to be downloaded in order to bring the client up-to-date. 

Index Base URL 



The index file is typically located in the root directory of the file hierarchy it describes. In that case the 
base URL of the files in the index is the same as the base URL of the index itself The index file should 
not be listed in the index itself. 

An index can also explicitly specify a different absolute or relative base URL using the base attribute of 
the index tag. This allows an index to describe files that are located in a different directory, or even on a 
different server. 

For example, if the index is loaded from the following URL: 
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http : //www . acme . com/Examples /index . xml 



Unless a different base URL is specified in the index tag, the base URL of the index will be: 



'http': //www . acme ; com/Examples/ 



That means that the index in the example above describes the following files: 



f^ttp : ://wwwyacme.;co 



^ht£p: //www -acme ..com/Exampl^s/imageB/acine.gif 
^l^tp. : )f />*Ww . acmei . com/Examp le 3 /imasres /banner • 9 i f V. 
Sihttp^^ ..com/Examplea/ java/classes/ Scroll"; ] ava 

fJ^^^My^^cae i com/Examplea/ java/classes/gui .jar* 



Index Expiration 

In the absence of any meta data, the client should observe the Expires field in the HTTP reply of the 
index request to determine how long the index is valid. When the index expires, a new version of the 
index should be downloaded. 

This provides a simple mechanism for scheduling updates of indexes. In some cases additional meta data 
will be available which can provide more detailed information for scheduling updates of the index and 
the content it describes. 

Index Caching 



<An HTTP proxy may cache indexes if the HTTP header indicates_that,this is allowed. An HTTP proxy 
should treat indexesjust like nonnalTiles/The HTTP/1 .0 > and HTTP/LI specification provide a variety 
of cache control mechanisms that can be used to control the caching of indexes. 

A server which generates personalized indexes that all have the same base URL will have to mark the 
index as not cacheable using the standard HTTP mechanisms as defined in HTTP/LO and HTTP/1 . 1 
protocol specifications. 

2.4 Content Based Addressing 

By requesting an index file it is possible for a client to obtain a complete description of the structure and 
state of a set of data files. Given an index, it is possible for a client to determine exactly which files need 
to be downloaded, as well as the total size of the download. 

Because the client can determine the exact set of files that need to be downloaded, it is possible to issue 
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multi-file get requests to get all the files at once. Various HTTP extensions proposed by the HTTP-NQ 
working group, and as part of the HTTP Pipelining proposal, can be used to make the download of 
multiple files more efficient. 



ji 



Note that a client -canusea disk-cache-for-dafc fil^^ ? 

means that.if multiple-indexes refer to a-file with the same content identifier,lhe client cMjutomtic^lx/ 
Cdetect 'thatthe-file \ % already-in the cachei-and'thus avoid downloading the fife a second time. This is not 
uncommon because different sites often refer to the same standard libraries or images, and because their 
content identifiers match, multiple redundant downloads can be avoided. Eliminating redundant 
downloads can reduce the.overhead of downloading commonly used data and software components.. _ ^ 
^Jsing contenUdentifie ™i from ^ 

different hosts* J 

An index is not required to contain a content identifier for each file, and the content identifier can be 
omitted for any or all of the files in the index. A server can decide to omit the content identifier for files 
that are dynamically generated, such as a live video picture. If no content identifier is present the file 
should be retrieved using a normal get-if-modified request. When comparing indexes, files without a 
content identifier should always be considered to be different. 

ContenMD Header Field 

Now that it is possible to obtain an index for a large set of files, a mechanism is needed to obtain the 
correct version of each of the files that need downloading. Note that the correct version of each file is 
determined by its content identifier as specified in the index. It is very important for the distribution of 
applications that the correct version of each file is obtained, rather than just the current version. 

When requesting a file^e clientcan include the coirt^^ / 
^serverTFor example: ^ ^ * — ~~" 



: : GBT /Bxample/horae.htiil HTTP/1. 1. 

: Contents ID: urn:md5:PEFjVfBDv/ad9alS?BYUX0w= 



A new HTTP header field called ContenMD is used to specify the correct version of the file that is 
requested. The server can use the content-identifier in the ContenMD field server to determine if the 
requested version of the file can be delivered to the client 

riTi7c6jTtem]den 

^ContraMD header field^The content identifier in the reply must match the requested content identifier. 
If thecorrect version of the file is not available, the server should respond with the error: "404 File 
Version Not Found". 

If no content identifier is specified in the HTTP GET request, then the server should return the current 
version of the file, as is the case in a normal HTTP request. However, the reply should always include 
the ContenMD field if the correct content identifier is known for the file that is returned. 

A server that is unaware of the ContenMD field will always reply with the current version of the 
requested file, regardless of the content identifier specified in the request. Note that the server is not 
required to specify a ContenMD in the reply, and that it is the responsibility of the client to verify that 

the reply contains the correct content identifier if a ContenMD field is present in therepJy_ v When the^ 

requestSI content identifier cbntei^ 
^checksumio^verify_that the correct content' Was tti^mlrp 
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2.5 Differential Downloads 

When a file is updated on the server, it will be downloaded by each of the clients that needs the new 
version. Updates to files very often affect only small portions of the file, and it would be much more 
efficient if the server could reply with only the parts of the file that have changed. This can be achieved 
using a differential GET request. 

Differential-ID Header Field 

A client can use the index obtained from a server to determine what changes have occurred in a set of 
files. It is also possible to detect which files were modified, rather than created or deleted. 



When a file is modified thecli ent can issu^ a.^yerew/ia/ijETrequest for the.file, which includes not 
only the content identifier of the desired version of the file, but also the content identifier of the current 
version of the file on the client 

In a differential GET request the content identifier of the file that the client currently possesses is 
specified using the Differential-ID header field. For example: 







Gib /E&xt^ejic^^ 








Content-ID ; urn-imd^ 








inferential 


dWPZptpNO ^qlffie^'aB^v- *• 







When the server receives a get request that includes a Differential-ID field, it can look in its file cache i 
for both versions of the requested file using the content identifiers specified by the Content-ID field and / 
the Differential-ID field. If both version of the file are found, the server can compute the difference ^/ / 
between the two files and return the diff rather than the entire file, 

When a server replies with a differential update, it must include both the content identifier of the 
resulting file in the Content-ID field, as well as the content identifier of the file to which the update 
should be applied in the Differential-ID field. 

If the server does not have access to the version of the file that is indicated by the differential content 
identifier, it can ignore the differential content identifier, and return the entire requested file. Also, if the 
diff is not smaller, the server can decide to the send the entire file instead. 



-A client can.det^j^ 

verifying the Content-ID and DiiTerentiaHD fields in the reply, the client can apply the diff to its current 
version of the file to generate the desired version of the file. 

To ensure interoperability it is necessary to specify a well known format which can be used to describe 
the differences between two version of any file. The default diff format is the GD1FF format which is 
defined in a separate proposal. The client can specify additional acceptable differencing formats using 
the Accept header field. 

A simple server can choose to ignore differential get requests altogether, and simply reply with the entire 
contents of the requested file. 

Note that although the example uses the Content-ID field to specify the desired version of a file, this 
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field can be omitted in order to obtain a differential update to the most recent version. The server should 
return a "304 Not Modified" reply if the requested resource has not changed. 



Here is a table that lists all the possible combinations and successful replies that can occur when 
specifying the Content-ID and Differential-ID in the HTTP request: 



Request 
Content 
ID 


Request 
Differential 
ID 


Action | 


Reply 
Content 
ID 


Reply 
Differential 
ID 


no 


: ; : -ho , 


Return the current version of the file. | 


yes 


no 


yes. 


no 


Return the correct version of the file as indicated 
by the Content-ID. 


yes 


no | 


; no ; 


: {- ;\ yes \ 


Return the diff between this current version and the . 
version identified b£the Kffi^htidriP^ 


yes 


yes 






tetiimM 4e Version identified by 
ffie Cohfcnt-ffi identified by 
Di£f^^ '; ■ . - : 


yes 


yes 



Differential Index Retrieval 



It is very important that obtaining the most up-to-date index is an efficient operation because clients will 
poll for updates repeatedly. It would be wasteful to send the entire index each time an index request is 
made, especially since there will often be little or no change since the last request Also note that indexes 
can contain many files, and as a result they can be rather large. 

To avoid downloading the complete index repeatedly, a client can use a differential get request to obtain 
an index. In response to a differential index request the server can reply with the diff between the client's 
index and the server's current index. 

The server decides what content identifier to use for each index. Each content identifier must uniquely 
identify the index. One way of doing this would be to construct an identifier out of a checksum 
computed on the index; another way would be by combining the URL of the index with a version 
number. 

Note that a differencing algorithm can be defined that performs a structural comparison of two indexes, 
rather than a textual comparison. 



16 HTTP Proxy Caching 

An HTTP proxy can be made aware of the Content-ID and Differential-ID fields in HTTP requests and 
replies. Because the content identifier is included in each GET request, the proxy can avoid accidentally 
returning the wrong version of a requested file. Getting the wrong version from an intermediate proxy 
often presents a serious problem when distributing mission critical applications and data. 

/^The proxy can use the content identifier field to uniquely identify the content that is being transferred, j , 
// The same piece of content, even when downloaded from multiple locations, is likely to have the same I 
(^L^^content identifier. The proxy can use this fact to avoid multiple redundant downloads. ^j) 

In addition, a proxy can use the Differential-ID header field to reply to differential GET requests. If both 
versions of the file are in the proxy's cache, the proxy can compute the differential reply. 



2.7 Backward Compatibility 
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The DRP protocol works efficiently in existing HTTP enabled environments. Some special precautions 
need to be taken to make sure that older proxy servers dorft interfere with the vereioiung strategy used 
by the DRP protocol 

HTTP/1.0 Proxy Caching 

To avoid the incorrect caching of data files in HTTP/LO proxies, the server should mark the files 
returned to a HTTP/1 .0 client or proxy as not cacheable. This can be done using the Pragma and/or 
Expires fields, as defined in the HTTP/1.0 specification. 

HTTP/1.1 Proxy Caching 

To correctly cache files in an HTTP/1 .1 proxy, the server can use the Vary field in the HTTP reply 
which should be set to "Content-ID". Here is an example of a reply which includes a Content-ID: 



HTTP/1.1 :20b OK \ . ..- S^:. 

Cdntjentrlb: i^ ; :ma5^^ 



If the request also contains a Differential-ID, the Vary header field should be set to 
, 'Content-lD > Differential-ID N . Here is an example of a reply to a differential GET request: 



Vary: 'Content- to, Differential- ID 



' :T : £? <: '\y ..'T-;. : - : "V : * '< 

" .Dif terentf^ 



Using this strategy, it is clear that client requests that specify a Content-ID will be cached appropriately 
by an HTTP/1 . 1 proxy. However, when the Content-ID is omitted, the proxy will not match the request 
and will contact server each time. 

Using the Vary header this way provides the best performance over HTTP/1 . 1 without weakening any of 
the caching guarantees required by the DRP protocol. 

Compatibility with Existing Servers 

The DRP protocol is defined so that it does not necessarily require any server support. The index can be 
specified using a normal file on a server that describes part or all of the site, It is easy to generate an 
index file using a simple program that scans the file system. 

The client can use the index file to determine which files to download from the site using normal HTTP 
request. If the server does not provide special DRP support, no guarantees can be given about file 
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versioning. However, if the index file specifies content identifiers for each file, the client can use these 
to verify that it has obtained the correct version by recomputing the appropriate checksums. 

Note that in this manner DRP can also be used with a number of existing URL types such as "ftp:" and 
M file:'\ 

3. Conclusion 

The HTTP Distribution and Replication Protocol provides significant added value to the HTTP protocol. 
It enables the distribution and replication of data files in a simple and efficient manner. DRP is a widely 
applicable technology and is appropriate for the distribution of HTML-based content as well as mission 
critical applications. 

The DRP defines the following new features: 

• Content identifiers, using the existing URI specification, which can uniquely identify a piece of 
content 

• An inde x form at which can be used to describe a set of files. 

• A new HTTP header field, Content-ID, which is used to obtain the correct version of a file by 
specifying a content identifier. 

• A new HTTP header field, Differential-ID, which is used to obtain a differential update for a file. 
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Appendices 

A. Index DTD 



Below is the XML Document Type Definition used in this document to describe an index: 
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Notes: 

• A valid index should start with the following XML declaration: 



<?XML VERSION= - 1.0" RMD*"NONE"?> 



• The mime type of an index is application/drp-indcx. 

. An index describes a hierarchical set of resources. The dir element represents a directory and can 

contain further elements, which describe its content. 
■ The id attribute is a content identifier as defined in this proposal. 

• The base attribute of the index element is used to override the default base URL of the index. The 
URL can be relative or absolute. 

• The path attribute of the file and dir elements is a relative path to the parent directory of the node. 
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Hie HTTP Distribution and Replication Protocol http7/www.w3.ofgOTVNOT&H!r T hl9970825.html 



d The size attribute of the file element is the size of the file in bytes. 
□ The mime attribute of the file element specifies the mime type of the file, 
o The info attribute of the file element can be used to specify further attributes of the file such as 
whether it is executable or not 
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